Picture for Sebastian Farquhar

Sebastian Farquhar

Gram: Assessing sabotage propensities via automated alignment auditing

Add code
May 28, 2026
Viaarxiv icon

Realistic honeypot evaluations for scheming propensity

Add code
May 28, 2026
Viaarxiv icon

Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

Add code
Apr 12, 2026
Viaarxiv icon

Practical challenges of control monitoring in frontier AI deployments

Add code
Dec 15, 2025
Viaarxiv icon

An Approach to Technical AGI Safety and Security

Add code
Apr 02, 2025
Viaarxiv icon

Do Multilingual LLMs Think In English?

Add code
Feb 21, 2025
Viaarxiv icon

Holistic Safety and Responsibility Evaluations of Advanced AI Models

Add code
Apr 22, 2024
Viaarxiv icon

Evaluating Frontier Models for Dangerous Capabilities

Add code
Mar 20, 2024
Figure 1 for Evaluating Frontier Models for Dangerous Capabilities
Figure 2 for Evaluating Frontier Models for Dangerous Capabilities
Figure 3 for Evaluating Frontier Models for Dangerous Capabilities
Figure 4 for Evaluating Frontier Models for Dangerous Capabilities
Viaarxiv icon

Challenges with unsupervised LLM knowledge discovery

Add code
Dec 18, 2023
Figure 1 for Challenges with unsupervised LLM knowledge discovery
Figure 2 for Challenges with unsupervised LLM knowledge discovery
Figure 3 for Challenges with unsupervised LLM knowledge discovery
Figure 4 for Challenges with unsupervised LLM knowledge discovery
Viaarxiv icon

Model evaluation for extreme risks

Add code
May 24, 2023
Figure 1 for Model evaluation for extreme risks
Figure 2 for Model evaluation for extreme risks
Figure 3 for Model evaluation for extreme risks
Figure 4 for Model evaluation for extreme risks
Viaarxiv icon